DATA 624 PROJECT1
Introduction
This report is intended for colleagues from a variety of backgrounds and contains both technical and non-technical explanations of the work conducted. The objective of this project was to perform the appropriate analysis in order to forecast two variables (of five provided) each from six different time series sets. We were provided a spreadsheet that contains 1622 periods of every variable in every set and were expected to forecast 140 periods. The sets are labeled S01, S02, S03, S04, S05 and S06 and each contains variables labeled V01, V02, V03, V05, and V07. Different variables are required to be forecast depending on the set, specified below:
- S01 – Forecast Var01, Var02
- S02 – Forecast Var02, Var03
- S03 – Forecast Var05, Var07
- S04 – Forecast Var01, Var02
- S05 – Forecast Var02, Var03
- S06 – Forecast Var05, Var07
For each category, we perform the following:
- Exploratory Data Analysis
- Visualize the data
- Identify trend, seasonality, stationarity, outliers, missing data, variance, etc. Interpret ACF and PACF to inform modeling approach.
- Data Preparation
- Prepare data for modeling, address outliers, missing values, stationarity, etc.
- Split data into training and testing sets.
- Data Modeling
- Prepare models. For this project, we base model accuracy using MAPE (Mean Absolute Percentage Error), selecting the model with the lowest score.
- Data Export. Using the best model, we predict the next 140 periods and write to disk.
Data Import
We read in the data and create separate data frames containing only the first 1622 observations.
| SeriesInd | category | Var01 | Var02 | Var03 | Var05 | Var07 |
|---|---|---|---|---|---|---|
| 40669 | S03 | 30.64286 | 123432400 | 30.34000 | 30.49000 | 30.57286 |
| 40669 | S02 | 10.28000 | 60855800 | 10.05000 | 10.17000 | 10.28000 |
| 40669 | S01 | 26.61000 | 10369300 | 25.89000 | 26.20000 | 26.01000 |
| 40669 | S06 | 27.48000 | 39335700 | 26.82000 | 27.02000 | 27.32000 |
| 40669 | S05 | 69.26000 | 27809100 | 68.19000 | 68.72000 | 69.15000 |
| 40669 | S04 | 17.20000 | 16587400 | 16.88000 | 16.94000 | 17.10000 |
| 40670 | S03 | 30.79857 | 150476200 | 30.46428 | 30.65714 | 30.62571 |
| 40670 | S02 | 11.24000 | 215620200 | 10.40000 | 10.45000 | 10.96000 |
| 40670 | S01 | 26.30000 | 10943800 | 25.70000 | 25.95000 | 25.86000 |
| 40670 | S06 | 28.24000 | 55416000 | 27.24000 | 27.27000 | 28.07000 |
| SeriesInd | category | Var01 | Var02 | Var03 | Var05 | Var07 |
|---|---|---|---|---|---|---|
| 40669 | S01 | 26.61 | 10369300 | 25.89 | 26.20 | 26.01 |
| 40670 | S01 | 26.30 | 10943800 | 25.70 | 25.95 | 25.86 |
| 40671 | S01 | 26.03 | 8933800 | 25.56 | 25.90 | 25.67 |
| 40672 | S01 | 25.84 | 10775400 | 25.36 | 25.60 | 25.75 |
| 40673 | S01 | 26.34 | 12875600 | 25.52 | 25.60 | 26.34 |
S01
par(mfrow = c(1,2))
s1 |>
select(c(Var01,Var02)) |>
ts() |>
autoplot(facets = TRUE, main = "S01") +
ylab('')s1 |>
ggplot(aes(x = Var01, y = Var02)) +
geom_point()## Warning: Removed 2 rows containing missing values (`geom_point()`).
Var01
Exploratory Data Analysis
v1 <- ts(s1$Var01)
ggtsdisplay(v1)## Warning: Removed 2 rows containing missing values (`geom_point()`).
- The data is *non-stationary**
- There are two missing values
- The ACF is highly significant across many lags
- The PACF cuts off at lag 1
- There are some outliers though they are difficult to observe
- There is an upward trend
- Variance appears to be stable
- There is some cyclicity in the data. There does not appear to be any seasonality
- The data is non-linear
We can see the lags are highly correlated using a lagplot,
gglagplot(v1)We can visualize the missing values to see where they appear within
the data using the imputeTS package, which we can later use
to impute the missing values using a moving average.
library(imputeTS)
v1 |>
ggplot_na_distribution()Data Preparation
To deal with the outliers, we’ll use the tsclean
function from the forecast package, which uses median
absolute deviation (MAD) to identify outliers and replace them with
imputed values based on neighboring observations.
For the missing values, we’ll use the na_ma() function
in the imputeTS package, which replaces missing values with
the weighted moving average of neighboring observations within a
specific window.
library(forecast)
v1 <- v1 |>
tsclean(lambda = "auto") |>
na_ma() # default window = 4We take a first difference of the data to see if we can achieve stationarity,
v1.diff <- diff(v1)
ggtsdisplay(v1.diff)There are still some outliers and significant values, but the ACF and PACF plots are informative: the ACF cuts off at lag 2, indicating a MA(2) component. The PACF also cuts off at lag 2, indication an AR(2) component.
Now we split the data into testing and training sets to prepare for modeling, setting aside 20% of the data for testing.
library(zeallot) # provides tuple assigment
train_test_split <- function(x, split) {
split.index <- round(length(x) * split) # split point index
train <- window(x, end = split.index)
test <- window(x, start = split.index + 1)
horizon = length(test)
return(list(train, test, horizon))
}
c(train, test, horizon) %<-% train_test_split(v1, 0.8)Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)## ses holt
## v1 10.01436 10.49858
Box.test(residuals(holt.fit))$p.value## [1] 0.02235277
Box.test(residuals(ses.fit))$p.value## [1] 0.005648436
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data")ETS Modeling
set.seed(123)
ets.fit <- ets(train, model = "MMN")
ets.fit## ETS(M,M,N)
##
## Call:
## ets(y = train, model = "MMN")
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 1e-04
##
## Initial states:
## l = 27.0027
## b = 1.0007
##
## sigma: 0.0127
##
## AIC AICc BIC
## 7172.067 7172.114 7197.910
A multiplicative error, multiplicative trend, no seasonality gives us the best model, but there are still problems as there is significance in the ACF: the model is not incorporating all the information in the data, which we can see in the residuals,
checkresiduals(ets.fit)##
## Ljung-Box test
##
## data: Residuals from ETS(M,M,N)
## Q* = 17.51, df = 10, p-value = 0.06382
##
## Model df: 0. Total lags used: 10
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)## ses holt ets
## v1 10.01436 10.49858 23.20971
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data")ARIMA Modeling
An ARIMA(2,1,2) model with log transformation (lambda=0)
achieves the best MAPE.
set.seed(123)
arima.fc <- train |>
Arima(order=c(2,1,2), lambda=0) |>
forecast(h=horizon)
checkresiduals(arima.fc)##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 7.7323, df = 6, p-value = 0.2584
##
## Model df: 4. Total lags used: 10
set.seed(123)
arima.acc <- c(v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)## ses holt ets arima
## v1 10.01436 10.49858 23.20971 9.994805
Results
s1v1.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)Var02
Exploratory Data Analysis
v2 <- ts(s1$Var02)
ggtsdisplay(v2)- This data has a lot of outliers
- There is a faint downward trend over time
- The variance in the data appears to decrease over time
- There is a strong autocorrelation among many lags
- There is no missing data
- The data is non-stationary
- There doesn’t appear to be any seasonality
gglagplot(v2)Although the ACF plot shows a highly significant autocorrelation, the lags show an inverse correlation between the time series and its lagged values.
Data Preparation
Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation
v2 <- v2 |>
tsclean(lambda = "auto") |>
log()
ggtsdisplay(diff(diff(v2)))A second-difference, log tranformation brings the data into stationarity.
We see the ACF cuts off at Lag 1, suggesting an MA(1) component, while the PACF trails off, further suggesting the MA(1) component.
gglagplot(diff(diff(log(v2))))We can see the inverse correlations in the lags are no longer present.
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v2, 0.8)Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)## ses holt
## v2 2.168109 2.168723
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data")ETS Modeling
set.seed(123)
ets.fit <- ets(train)
ets.fit## ETS(A,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.4238
##
## Initial states:
## l = 16.1856
##
## sigma: 0.311
##
## AIC AICc BIC
## 6276.973 6276.991 6292.478
checkresiduals(ets.fit)##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 69.91, df = 10, p-value = 4.614e-11
##
## Model df: 0. Total lags used: 10
The p-value is far too small to be useful, but we’ll make the predictions and store the results for continuity,
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)## ses holt ets
## v2 2.168109 2.168723 2.168141
ARIMA modeling
set.seed(123)
train |>
auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
##
## Model df: 5. Total lags used: 10
The p-value of the residuals is still too low to be considered a good model, but it’s the best we’ve found so far,
set.seed(123)
arima.fc <- train |>
Arima(order=c(1,1,4), lambda=0) |>
forecast(h=horizon)
checkresiduals(arima.fc)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
##
## Model df: 5. Total lags used: 10
set.seed(123)
arima.acc <- c(v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc## v2
## 2.118945
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data")Results
s1v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S01\n\n")## MAPE results by model - S01
rbind(s1v1.results,s1v2.results)## ses holt ets arima
## v1 10.014358 10.498584 23.209708 9.994805
## v2 2.168109 2.168723 2.168141 2.118945
S02
par(mfrow = c(1,2))
s2 |>
select(c(Var02,Var03)) |>
ts() |>
autoplot(facets = TRUE, main = "S02") +
ylab('')s2 |>
ggplot(aes(x = Var02, y = Var03)) +
geom_point()## Warning: Removed 4 rows containing missing values (`geom_point()`).
Var02
Exploratory Data Analysis
v2 <- ts(s2$Var02)
ggtsdisplay(v2)- The data is non-stationary
- There is a very faint downward trend over time.
- There is high variance in the data
- There does not appear to be any seasonality in the data
- The ACF plot indicates the lags are highly significant and trail off
over time, indicating an AR component. The PACF model cuts off after 8,
indicating an MA component.
- There are many outliers in the data
- There are no missing values in the data
gglagplot(v2)The lags show an inverse correlation.
Data Preparation
Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation
v2 <- v2 |>
tsclean(lambda = "auto")
ggtsdisplay(diff(log(v2)))The data now appears stationary. The ACF cuts off at 2 indicating an MA(2) component, the PACF cuts off at 9, indicating an AR(9) component.
We can see the inverse correlations in the lags are no longer present.
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v2, 0.8)Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)## ses holt
## v2 27.55139 27.56873
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)## [,1]
## holt_pval 5.881639e-08
## ses_pval 7.954414e-08
The p-values are very low. These models do not fit the data well.
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")ETS Modeling
set.seed(123)
train.diff <- diff(log(train))
ets.fit <- ets(train.diff)
ets.fit## ETS(A,N,N)
##
## Call:
## ets(y = train.diff)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = -5e-04
##
## sigma: 0.3676
##
## AIC AICc BIC
## 6704.535 6704.553 6720.038
Because we differenced and logged the forecasts, we need to back transfrom the forecasts to match the test data.
library(stats)
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.fc <- exp(cumsum(ets.fc$mean))
ets.acc <- c(v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)## ses holt ets
## v2 27.55139 27.56873 100
checkresiduals(ets.fc)##
## Ljung-Box test
##
## data: Residuals
## Q* = 2988, df = 10, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 10
We can see that this model also does not fit the data very well,
ARIMA modeling
set.seed(123)
train |>
auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,3)
## Q* = 6.2503, df = 5, p-value = 0.2826
##
## Model df: 5. Total lags used: 10
set.seed(123)
arima.fc <- train |>
auto.arima(lambda=0) |>
forecast(h=horizon)
arima.acc <- c(v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc## v2
## 28.31755
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")This model fits the data pretty well.
Results
s2v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s2v2.results## ses holt ets arima
## v2 27.55139 27.56873 100 28.31755
Var03
Exploratory Data Analysis
v3 <- ts(s2$Var03)
ggtsdisplay(v3)## Warning: Removed 4 rows containing missing values (`geom_point()`).
- The data is non-stationary
- There is no noticeable trend over time
- There are four missing values
- There is one extreme outlier
- There does not appear to be any seasonality in the data
- The ACF plot indicates the lags are highly significant and trail off over time, indicating an AR component. The PACF model cuts off at lag 1, indicating an AR(1) component.
gglagplot(v3)The lags show a positive linear correlation over shorter periods of time and inverse correlation over longer periods.
Data Preparation
Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation
v3 <- v3 |>
tsclean(lambda = "auto") |>
na_ma()
ggtsdisplay(v3)Let’s take a look at a difference of the data to see if it achieves stationarity,
ggtsdisplay(diff(v3))Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v3, 0.8)Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)## ses holt
## v3 17.17073 17.28684
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)## [,1]
## holt_pval 1.352437e-03
## ses_pval 5.046795e-05
The p-values are very low. These models do not fit the data well.
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")ETS Modeling
set.seed(123)
ets.fit <- ets(train)
ets.fit## ETS(A,Ad,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.0266
## phi = 0.8214
##
## Initial states:
## l = 9.4723
## b = 0.4578
##
## sigma: 0.2565
##
## AIC AICc BIC
## 5779.563 5779.628 5810.574
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)## ses holt ets
## v3 17.17073 17.28684 17.28755
checkresiduals(ets.fc)##
## Ljung-Box test
##
## data: Residuals from ETS(A,Ad,N)
## Q* = 29.746, df = 10, p-value = 0.0009426
##
## Model df: 0. Total lags used: 10
We can see from the low p-value that this model also does not fit the data very well,
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")ARIMA modeling
set.seed(123)
train |>
auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)
## Q* = 11.722, df = 8, p-value = 0.1641
##
## Model df: 2. Total lags used: 10
set.seed(123)
arima.fc <- train |>
auto.arima(lambda=0) |>
forecast(h=horizon)set.seed(123)
arima.acc <- c(v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc## v3
## 17.29381
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")This model also does not fit the data well.
Results
s2v3.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S02\n\n")## MAPE results by model - S02
rbind(s2v2.results,s2v3.results)## ses holt ets arima
## v2 27.55139 27.56873 100.00000 28.31755
## v3 17.17073 17.28684 17.28755 17.29381
S03
par(mfrow = c(1,2))
s3 |>
select(c(Var05,Var07)) |>
ts() |>
autoplot(facets = TRUE, main = "S03") +
ylab('')s3 |>
ggplot(aes(x = Var05, y = Var07)) +
geom_point()## Warning: Removed 4 rows containing missing values (`geom_point()`).
We can see that the data are very similiar and have a strong positive correlation.
Var05
Exploratory Data Analysis
v5 <- ts(s3$Var05)
ggtsdisplay(v5)## Warning: Removed 4 rows containing missing values (`geom_point()`).
- The data is non-stationary.
- There is an upward trend over time.
- There is some cyclicity, but no discernible seasonality.
- The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
- There are some missing values.
- There are some outliers present.
gglagplot(v5)The lagplot indicates a positive linear relationship across lags.
Data Preparation
Dealing with outliers and missing values,
v5 <- v5 |>
tsclean(lambda = "auto") |>
na_ma()
ggtsdisplay(v5)Let’s difference the data and see if it achieves stationarity,
checkresiduals(diff(v5))##
## Ljung-Box test
##
## data: Residuals
## Q* = 61.078, df = 10, p-value = 2.265e-09
##
## Model df: 0. Total lags used: 10
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v5, 0.8)Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)## ses holt
## v5 15.45108 19.26787
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)## [,1]
## holt_pval 0.8696317
## ses_pval 0.8115583
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")ETS Modeling
set.seed(123)
ets.fit <- ets(train)
ets.fit## ETS(M,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9277
##
## Initial states:
## l = 30.4884
##
## sigma: 0.0179
##
## AIC AICc BIC
## 9655.076 9655.095 9670.582
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)## ses holt ets
## v5 15.45108 19.26787 15.45439
checkresiduals(ets.fc)##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 17.805, df = 10, p-value = 0.05834
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")ARIMA modeling
set.seed(123)
train |>
auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1) with drift
## Q* = 13.318, df = 8, p-value = 0.1014
##
## Model df: 2. Total lags used: 10
set.seed(123)
arima.fc <- train |>
auto.arima(lambda=0) |>
forecast(h=horizon)set.seed(123)
arima.acc <- c(v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc## v5
## 40.19997
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")Results
s3v5.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s3v5.results## ses holt ets arima
## v5 15.45108 19.26787 15.45439 40.19997
Var07
Exploratory Data Analysis
v7 <- ts(s3$Var07)
ggtsdisplay(v7)## Warning: Removed 4 rows containing missing values (`geom_point()`).
- The data is non-stationary.
- There is an upward trend over time.
- There is some cyclicity, but no discernible seasonality.
- The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
- There are some missing values.
- There are some outliers present.
gglagplot(v7)The lagplot indicates a positive linear relationship across lags.
Data Preparation
Dealing with outliers and missing values,
v7 <- v7 |>
tsclean(lambda = "auto") |>
na_ma()
ggtsdisplay(v7)We’ve been checking differences manually, but we can also call the
ndiffs function to determine the appropriate number of
first differences,
ndiffs(v7)## [1] 1
One difference is required to make the v7 data
stationary.
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v7, 0.8)Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)## ses holt
## v7 15.32257 18.70147
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)## [,1]
## holt_pval 0.5457403
## ses_pval 0.3956926
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")ETS Modeling
set.seed(123)
ets.fit <- ets(train)
ets.fit## ETS(M,A,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 1e-04
##
## Initial states:
## l = 30.7662
## b = 0.0751
##
## sigma: 0.0168
##
## AIC AICc BIC
## 9495.793 9495.839 9521.635
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(v7 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)## ses holt ets
## v7 15.32257 18.70147 26.68161
checkresiduals(ets.fc)##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,N)
## Q* = 18.471, df = 10, p-value = 0.04752
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")ARIMA modeling
set.seed(123)
train |>
auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0) with drift
## Q* = 18.498, df = 10, p-value = 0.04712
##
## Model df: 0. Total lags used: 10
set.seed(123)
arima.fc <- train |>
auto.arima(lambda=0) |>
forecast(h=horizon)set.seed(123)
arima.acc <- c(v7 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc## v7
## 40.03802
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")Results
s3v7.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S03\n\n")## MAPE results by model - S03
rbind(s3v5.results,s3v7.results)## ses holt ets arima
## v5 15.45108 19.26787 15.45439 40.19997
## v7 15.32257 18.70147 26.68161 40.03802